Airbnb Paris dataset analysis

Datacamp 2021

Alexandre PERBET
Cyril NERIN
Hugo RIALAN
Paul ORLUC
Walid CHRIMNI
Zakaria BEKKR



INTRODUCTION

Airbnb is a community platform service that connects travelers with hotel companies, rental property investors, and individuals who rent out all or part of their own home as a spare home. The site offers a search and booking platform between the person offering their accommodation and a renter. It covers more than 1.5 million rental ads in over 34,000 cities and 191 countries. In our study, we will restrict ourselves to the city of Paris.

We will use Machine Learning algorithms to predict the price of an Airbnb rental in Paris.

The use case of our work would be :

Data source : http://insideairbnb.com/get-the-data.html

Libraries

Load Data

Exploratory data analysis

The goal of this part is to compute exploratory data analysis on the dataset to have a good overview of it and gain information.

As some part require accurate information, we also do some little preprocessing to have the best exploratory analysis possible.

Data dictionary

The data from df_main

We notice that the attribute "neighbourhood_group" is not filled in

Location of the apartments for rent

Textual variables

column "name"

We can see that the description of the apartments listed in our dataset are skewed towards upper class paris areas. We can also see that there is especially succesful appartment quality adjectives like : cosy, bright, charming etc

Column "neighbourhood"

The data from df_listings

Construction of the dataset that will be used for the analysis

Cleaning the data set

Dealing with Missing Values

Columns deletion

Neighbourhood_group

We note that the neighbourhood_group has only one unique value which is nan. As a result, this column is not filled in, we then delete it.

License

Moreover,license holds more than half of its content as missing values. Being a column bearing few meaning we choosed to delete it.

last_review

In addition, a relevant default date cannot be specified. We choosed to delete it.

Replacement of missing data with default values

As we can see in the following output, in our new, cleaned data set, we no longer have any missing data:

Outliers processing

The variable 'price' (target variable)

WARNING: We observe abnormally high prices for the categories "Entire_home/Apt" and "private_room". These outliers on the target variable prevent us from having a good Machine Learning model. We need to remove them.

NOTE: The average price and the distribution of prices around this average value are very different according to the variable room_type

Univariate Analysis

Profiling the dataset df

Target variable description

As expected, the most expensive room are the houses where the entire home is available. We don't have much data on the two last categories (shared room and hotel room) so we can't say much about it.

This plot show us the prices according to the neighborhood. With the removal of outliers, location does not have a significant impact on price

Correlation of 'price' with other variables

The coefficient of determination (squared correlation) is calculated to ignore the sign of the value

We note that the variable 'price' is mainly corelated to the variables 'accommodates', 'bedrooms', 'beds' and 'availability_*'. We will therefore reduce the selected variables to improve the readability of the matrix of coefficients of determination.

Feature Engineering

Preprocessing

Train and test data

Evaluation of regression models

Selection of best models

Cross validation

Le résultat n'est pas bon. Il faut essayer d'améliorer la partie "Feature Engineering" avant d'optimiser le meilleur modèle

Model optimisation

Explainability

Interpretability is defined as the ability for a human to understand the reasons for a model’s decision. This criterion has become preponderant for many reasons:

Conclusion

Synthèse des résultats obtenus